<HTML>
<!-- (c) Copyright 2003, 2004, 2005, 2006, 2007, 2008, 2009 Hewlett-Packard Development Company, LP -->


<HEAD>
<TITLE>Jena Relational Database backend</TITLE>
<link href="../styles/doc.css" rel="stylesheet" type="text/css" />
</HEAD>



<BODY bgcolor="#FFFFFF">



<h1>Jena2 Database Interface - Database Layout</h1>



<p>This document provides some details on the Jena2 database schema and the
encoding used to store URIs and literal values.</p>
<h2>Overview - Denormalized Triple Store</h2>
<p>A widely-used scheme for storing RDF statements in a relational database is
the <i>triple store</i>. In this approach, each RDF statement is streod as a
single row in a three column 'statement' table. Typically, a fourth column
is added to indicate if the object is a literal or a URI. A common variation of
this scheme which uses much less storage space is the <i>normalized triple store</i>
approach. This scheme uses a statement table plus a literals table and a resources table.</p>
<p>The literals table stores the literals for all statements and the resources table stores all the
resources from all the statements. The statement table stores the
subject, predicate and object but instead of storing the values directly, it
stores references to the values  in the resources and literals tables.  This
normalized scheme uses less space than the standard
triple store approach since a literal value or resource URI is only stored once,
regardless of the number of times it occurs in statements. The space savings
comes at a cost however since retrieving the subject, predicate and object
values for a statement requires a 3-way join of the statement, literals and
resources tables.&nbsp; Jena1 used a normalized triple store approach.</p>
<p>Jena2 stores RDF statements using a <i>denormalized</i> triple store approach
which is a hybrid of the standard triple store and the normalized triple store.
This scheme uses a statement table, a literals table and resources table as
before. However, the statement table may contain either the values themselves or
references to  values in the literals and resources tables. 'Short' literals
are stored directly in the statement table and 'long'
literals are stored in the literals table. Similarly short URIs are stored directly in the statement
table and long URIs are stored in the resources table. The length threshold
for short vs. long is configurable but the default length is 256.</p>
<p>The motivation for the Jena2 schema is to combine the space savings of the
normalized approach with the retrieval efficiency (i.e., no joins) of the
standard triple store. In Jena2, long literal and object values are only stored
once. Short values may be stored multiple times but, since they are short in
length, the additional storage space required is deemed acceptable.</p>
<p>It is hoped that most Jena2 retrieval operations can be accomplished by
accessing just the statement table. The retrieval operation returns all statements
that match a value for subject, predicate and object. So
long as the value to be matched is a short literal or URI, the retrieval can be
performed with a simple SQL select on the statement table. However, if the match value is a long value,
then the literals or resources tables must first be searched for the identifier
of the matching value. Then, the statement table is searched for the
identifier of the matching value rather than the value itself.</p>
<p>In terms of a space-time trade-off, the Jena2 approach uses extra storage space
in order to save
on response time. However, the user can configure the threshold for short vs.
long values (see <i>LongObjectLength</i> in <a href="options.html">Options</a>) and so adjust the space-time trade-off to the needs of the
application. More details on the Jena2 storage subsystem are available in the
Hewlett-Packard Laboratories Technical Report <i>Efficient RDF Storage and
Retrieval in Jena2</i>, HPL-2003-266.</p>
<h2>Statement Tables</h2>
<p>Jena2 uses two types of statement tables, one type for asserted statements
and a second type for reified statements. The benefit of the reified statement
table is that it stores reified statements in an optimized form. Recall that a
reified statement is expressed in RDF as four individual RDF statements. Storing
this would require four rows in a standard triple store. With a reified
statement table, a reified statement can be stored as a single row. For
applications that use a large number of reified statements, the space savings
can be substantial.</p>
<p>Note that, by default, each graph (model) is stored in its own pair of
statement tables (one for asserted statements, one for reified statements). However, Jena2 allows models to share statement tables as well
(see <i>StoreWithModel</i> in <a href="options.html">Options</a>). The table
layouts for the various Jena2 tables are sketched below. <i>Varchar(n)</i>
denotes a variable-length string with a maximum length of <i>n</i> characters (<i>n</i>
is configurable). The encoding
used for the variable-length character string columns is described later in this
document.</p>
<p><b>Asserted Statement Table</b> (Jena_G<i>i</i>T<i>j</i>_Stmt)</p>
<table border="1" cellpadding="0" cellspacing="0" style="border-collapse: collapse" bordercolor="#111111" width="679" id="AutoNumber7">
  <tr>
    <td style="text-align: left" width="55"><b>Column</b></td>
    <td style="text-align: left" width="129"><b>Type</b></td>
    <td style="text-align: left" width="461"><b>Description</b></td>
  </tr>
  <tr>
    <td style="text-align: left" width="55">Subj</td>
    <td style="text-align: left" width="129">Varchar(n) not null</td>
    <td style="text-align: left" width="461">Subject of asserted statement
    (encoded)</td>
  </tr>
  <tr>
    <td style="text-align: left" width="55">Prop</td>
    <td style="text-align: left" width="129">Varchar(n) not null</td>
    <td style="text-align: left" width="461">Predicate of asserted statement
    (encoded)</td>
  </tr>
  <tr>
    <td style="text-align: left" width="55">Obj</td>
    <td style="text-align: left" width="129">Varchar(n) not null</td>
    <td style="text-align: left" width="461">Object of asserted statement
    (encoded)</td>
  </tr>
  <tr>
    <td style="text-align: left" width="55">GraphId</td>
    <td style="text-align: left" width="129">Integer</td>
    <td style="text-align: left" width="461">Identifier of graph (model) that
    contains the asserted statement</td>
  </tr>
</table>
<p>Indexes: non-unique index on &lt;Subj,Prop&gt;; non-unique index on &lt;Obj&gt;</p>
<p>These tables hold asserted (non-reified) statements for one or more graphs. The table name is generated and has the form Jena_G<i>i</i>T<i>j</i>_Stmt
where <i>i </i>is a graph identifier and a <i>j</i> is a table counter for the
graph, e.g., Jena_G1T1_Stmt. </p>
<p><b>Reified Statement Table</b> (Jena_G<i>i</i>T<i>j</i>_Reif)</p>
<table border="1" cellpadding="0" cellspacing="0" style="border-collapse: collapse" bordercolor="#111111" width="679" id="AutoNumber8">
  <tr>
    <td style="text-align: left" width="58"><b>Column</b></td>
    <td style="text-align: left" width="128"><b>Type</b></td>
    <td style="text-align: left" width="459"><b>Description</b></td>
  </tr>
  <tr>
    <td style="text-align: left" width="58">Subj</td>
    <td style="text-align: left" width="128">Varchar(n)</td>
    <td style="text-align: left" width="459">Subject of reified statement
    (encoded)</td>
  </tr>
  <tr>
    <td style="text-align: left" width="58">Prop</td>
    <td style="text-align: left" width="128">Varchar(n)</td>
    <td style="text-align: left" width="459">Predicate of reified statement
    (encoded)</td>
  </tr>
  <tr>
    <td style="text-align: left" width="58">Obj</td>
    <td style="text-align: left" width="128">Varchar(n)</td>
    <td style="text-align: left" width="459">Object of reified statement
    (encoded)</td>
  </tr>
  <tr>
    <td style="text-align: left" width="58">GraphId</td>
    <td style="text-align: left" width="128">Integer</td>
    <td style="text-align: left" width="459">Identifier of graph (model) that
    contains the asserted statement</td>
  </tr>
  <tr>
    <td style="text-align: left" width="58">Stmt</td>
    <td style="text-align: left" width="128">Varchar(n) not null</td>
    <td style="text-align: left" width="459">Identifier (URI) of reified
    statement (encoded)</td>
  </tr>
  <tr>
    <td style="text-align: left" width="58">HasType</td>
    <td style="text-align: left" width="128">Char(1) not null</td>
    <td style="text-align: left" width="459">'T' if the graph (model) contains
    the statement (Stmt, rdf:Type, rdf:Statement), else ' '</td>
  </tr>
</table>
<p>Indexes: unique index on &lt;Stmt,HasType&gt;; non-unique index on &lt;Subj,Prop&gt;;
non-unique index on &lt;Obj&gt;<br>
&nbsp;</p>
<p>These tables hold reified statements for one or more graphs. The table name is generated and has the form Jena_G<i>i</i>T<i>j</i>_Reif
where <i>i </i>is a graph identifier and a <i>j</i> is a table counter for the
graph, e.g., Jena_G1T2_Reif. A row of the reified statement table, say<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ex:subj, ex:prop, ex:obj, 1, ex:stmt,
'T'<br>
would represent the following four asserted statements in the graph (model) with
identifier 1:<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ex:stmt, rdf:Subject, ex:subj .<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ex:stmt, rdf:Property, ex:prop .<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ex:stmt, rdf:Object, ex:obj .<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ex:stmt, rdf:Type, rdf:Statement .</p>
<p><br>
&nbsp;</p>
<h2>System Tables</h2>
<p>The Jena2 system tables store metadata as well as the long values for
literals and resources inthe statement tables. As before, <i>Varchar(n)</i>
dentoes a variable-length character string with a maximum length of <i>n</i>
characters.&nbsp; <i>Blob</i> denotes a very long character string whose maximum
length depends on the database engine. Typically, database engines do not
support indexes on blob columns.</p>
<p><b>System Statement Table</b> (Jena_Sys_Stmt)<br>
&nbsp;</p>
<table border="1" cellpadding="0" cellspacing="0" style="border-collapse: collapse" bordercolor="#111111" width="679" id="AutoNumber1">
  <tr>
    <td style="text-align: left" width="70"><b>Column</b></td>
    <td style="text-align: left" width="140"><b>Type</b></td>
    <td style="text-align: left" width="435"><b>Description</b></td>
  </tr>
  <tr>
    <td style="text-align: left" width="70">Subj</td>
    <td style="text-align: left" width="140">Varchar(n) not null</td>
    <td style="text-align: left" width="435">Subject of metadata statement
    (encoded)</td>
  </tr>
  <tr>
    <td style="text-align: left" width="70">Prop</td>
    <td style="text-align: left" width="140">Varchar(n) not null</td>
    <td style="text-align: left" width="435">Predicate of metadata statement
    (encoded)</td>
  </tr>
  <tr>
    <td style="text-align: left" width="70">Obj</td>
    <td style="text-align: left" width="140">Varchar(n) not null</td>
    <td style="text-align: left" width="435">Object of metadata statement
    (encoded)</td>
  </tr>
  <tr>
    <td style="text-align: left" width="70">GraphId</td>
    <td style="text-align: left" width="140">Integer</td>
    <td style="text-align: left" width="435">Always zero, representing the
    system (meta) graph</td>
  </tr>
</table>
<p>Indexes: non-unique index on &lt;Subj,Prop&gt;</p>
<p>The system statement table is used to store system metadata for the Jena2
storage subsystem such as configuration parameters, table names for graphs,
etc.. The metadata is expressed as RDF statements so this table looks very much
like an asserted statement table.</p>
<p>
<br>
<b>Long Literals Table</b> (Jena_Long_Lit)<br>
&nbsp;</p>
<table border="1" cellpadding="0" cellspacing="0" style="border-collapse: collapse" bordercolor="#111111" width="679" id="AutoNumber2">
  <tr>
    <td style="text-align: left" width="55"><b>Column</b></td>
    <td style="text-align: left" width="131"><b>Type</b></td>
    <td style="text-align: left" width="459"><b>Description</b></td>
  </tr>
  <tr>
    <td style="text-align: left" width="55">Id</td>
    <td style="text-align: left" width="131">Integer not null</td>
    <td style="text-align: left" width="459">Identifier of long literal,
    referenced from the statement tables</td>
  </tr>
  <tr>
    <td style="text-align: left" width="55">Head</td>
    <td style="text-align: left" width="131">Varchar(n) not null</td>
    <td style="text-align: left" width="459">First n characters of long literal
    (encoded)</td>
  </tr>
  <tr>
    <td style="text-align: left" width="55">ChkSum</td>
    <td style="text-align: left" width="131">Integer</td>
    <td style="text-align: left" width="459">Checksum of tail of long literal</td>
  </tr>
  <tr>
    <td style="text-align: left" width="55">Tail</td>
    <td style="text-align: left" width="131">Blob</td>
    <td style="text-align: left" width="459">Remainder of long literal (long
    literal without the head)</td>
  </tr>
</table>
<p>Indexes: unique index on &lt;Head,ChkSum&gt;<br>
Primary Key: Id</p>
<p>The long literals table stores literals that are too long to store directly
in a statement table. Each long literal is assigned a unique integer identifier
and this identifier is used to reference the literal from a statement table. To
support indexing of long literals, the first <i>n</i> characters of the long
literal are stored in the Head column. The length of the head is configurable
(see <i>IndexKeyLength</i> in <a href="options.html">Options</a>). The remainder
of the long literal is stored as a blob in the Tail column. A checksum is
computed over the tail to distinguish long literals with the same Head.<br>
<br>
<b>Long Resources Table</b> (Jena_Long_URI)<br>
&nbsp;</p>
<table border="1" cellpadding="0" cellspacing="0" style="border-collapse: collapse" bordercolor="#111111" width="679" id="AutoNumber3">
  <tr>
    <td style="text-align: left" width="53">Column</td>
    <td style="text-align: left" width="132">Type</td>
    <td style="text-align: left" width="460">Description</td>
  </tr>
  <tr>
    <td style="text-align: left" width="53">Id</td>
    <td style="text-align: left" width="132">Integer not null </td>
    <td style="text-align: left" width="460">Identifier of long URI, referenced
    from the statement tables</td>
  </tr>
  <tr>
    <td style="text-align: left" width="53">Head</td>
    <td style="text-align: left" width="132">Varchar(n) not null</td>
    <td style="text-align: left" width="460">First n characters of long URI
    (encoded)</td>
  </tr>
  <tr>
    <td style="text-align: left" width="53">ChkSum</td>
    <td style="text-align: left" width="132">Integer</td>
    <td style="text-align: left" width="460">Checksum of tail of long URI</td>
  </tr>
  <tr>
    <td style="text-align: left" width="53">Tail</td>
    <td style="text-align: left" width="132">Blob</td>
    <td style="text-align: left" width="460">Remainder of long URI (long URI
    without the head)</td>
  </tr>
</table>
<p>Indexes: unique index on &lt;Head,ChkSum&gt;<br>
Primary Key: Id</p>
<p>The long resources table stores URIs that are too long to store directly in a
statement table. Each long URI is assigned a unique integer identifier and this
identifier is used to reference the URI from a statement table. The columns of
the long resources table are similar to those in the long literals table.
However, the encoding of URIs is different from that of long literals. In
particular, URIs may have prefixes. See the  prefixes table and the encoding
discussion, below.</p>
<p><br>
<b>Prefixes Table</b> (Jena_Prefix)<br>
&nbsp;</p>
<table border="1" cellpadding="0" cellspacing="0" style="border-collapse: collapse" bordercolor="#111111" width="679" id="AutoNumber4">
  <tr>
    <td style="text-align: left" width="52">Column</td>
    <td style="text-align: left" width="132">Type</td>
    <td style="text-align: left" width="461">Description</td>
  </tr>
  <tr>
    <td style="text-align: left" width="52">Id</td>
    <td style="text-align: left" width="132">Integer not null </td>
    <td style="text-align: left" width="461">Identifier of  prefix,
    referenced from the statement tables</td>
  </tr>
  <tr>
    <td style="text-align: left" width="52">Head</td>
    <td style="text-align: left" width="132">Varchar(n) not null</td>
    <td style="text-align: left" width="461">First n characters of  prefix
    (encoded)</td>
  </tr>
  <tr>
    <td style="text-align: left" width="52">ChkSum</td>
    <td style="text-align: left" width="132">Integer</td>
    <td style="text-align: left" width="461">Checksum of tail of long prefix</td>
  </tr>
  <tr>
    <td style="text-align: left" width="52">Tail</td>
    <td style="text-align: left" width="132">Blob</td>
    <td style="text-align: left" width="461">Remainder of long prefix (long
    prefix without the head)</td>
  </tr>
</table>
<p>Indexes: unique index on &lt;Head,ChkSum&gt;<br>
Primary Key: Id</p>
<p>URIs often have common prefixes (e.g., <code>http://www.example.org/prop1</code>, <code>
http://www.example.org/prop2)</code>. There can be substantial space savings if
prefixes are stored just once. Jena2 will optionally store common URI prefixes in the  prefixes
table (see <i>DoCompressURI </i> in <a href="options.html">Options</a>). This
table is structurally similar to the long literals and long resources tables.</p>
<p><br>
<b>Graph Table</b> (Jena_Graph)<br>
&nbsp;</p>
<table border="1" cellpadding="0" cellspacing="0" style="border-collapse: collapse" bordercolor="#111111" width="679" id="AutoNumber5">
  <tr>
    <td style="text-align: left" width="51">Column</td>
    <td style="text-align: left" width="105">Type</td>
    <td style="text-align: left" width="489">Description</td>
  </tr>
  <tr>
    <td style="text-align: left" width="51">Id</td>
    <td style="text-align: left" width="105">Integer not null</td>
    <td style="text-align: left" width="489">Unique identifier for graph</td>
  </tr>
  <tr>
    <td style="text-align: left" width="51">Name</td>
    <td style="text-align: left" width="105">Blob</td>
    <td style="text-align: left" width="489">Graph name</td>
  </tr>
</table>
<p>Primary Key: Id</p>
<p>The graph table stores the name and unique identifier for each user graph. An
unnamed graph appears under the name DEFAULT.</p>
<p><br>
<b>Lock Table</b> (Jena_Mutex)<br>
&nbsp;</p>
<table border="1" cellpadding="0" cellspacing="0" style="border-collapse: collapse" bordercolor="#111111" width="679" id="AutoNumber6" height="54">
  <tr>
    <td style="text-align: left" height="25" width="51">Column</td>
    <td style="text-align: left" height="25" width="50">Type</td>
    <td style="text-align: left" height="25" width="544">Description</td>
  </tr>
  <tr>
    <td style="text-align: left" height="26" width="51">Dummy</td>
    <td style="text-align: left" height="26" width="50">Integer</td>
    <td style="text-align: left" height="26" width="544">Unused</td>
  </tr>
</table>
<p>The lock table is used internally by Jena2 to implement a critical section. It has no meaningful
content. It the table exists,  the database is locked for
an internal critical section, e.g., creating a model. Normally, the lock is released
at the end of the critical section by deleting the table. But, if a Jena application fails while
in a critical section, i.e., while holding the lock, users may have to manually release the lock either by deleting the table or calling DriverRDB.unlockDB().<br>
<br>
&nbsp;</p>
<h2>Value Encoding in Statement and Long Tables</h2>
<p>The following describes the encoding used for literals, URIs and other values
in the statement tables and
the system tables. Square brackets delimit optional components of the
encoding. Italics are used for variable values.</p>
<p><u><font size="2">Literal Encoding in Statement Tables</font></u><br>
Short Literal. <b>Lv:</b>[<i>langLen</i>]<b>:</b>[<i>datatypeLen</i>]<b>:</b>[<i>langString</i>][<i>datatypeString</i>]<i>value</i>[<b>:</b>]<br>
Long Literal. <b>Lr:</b>dbid<br>
<br>
<u><font size="2">Literal Encoding in Long Literal Table</font></u><br>
Literal. <b>Lv:</b>[<i>langLen</i>]<b>:</b>[<i>datatypeLen</i>]<b>:</b>[<i>langString</i>][<i>datatypeString</i>]<i>head</i>[<b>:</b>]
<i>hash</i> <i>tail</i><br>
<br>
<u>Legend</u><br>
<b>L</b> indicates a literal<br>
<b>v</b> indicates a value<br>
<b>r</b> indicates a reference to another table<b><br>
:</b> is used as a delimiter. Note that MySQL trims trailing white space for
certain Varchar columns so an extra delimiter is appended when necessary for
those columns.<br>
<i>dbid</i> an (integer) identifier of an entry in the long literals table<br>
<i>langLen</i> is the length of the language identifier for the literal<br>
<i>langString</i> is the language identifier<br>
<i>datatypeLen</i> is the length of the datatype for the literal<br>
<i>datatypeString</i> is the datatype for the literal<br>
<i>value</i> is the lexical form of the string<br>
<i>head</i> is the initial substring of the literal that is indexed<br>
<i>hash</i> is the CRC32 hash value for the tail<br>
<i>tail</i> is the remainder of the literal that is not  indexed<br>
<br>
<br>
<br>
<u>URI Encoding in Statement Tables</u><br>
Short URI. <b>Uv:</b>[<i>pfx_dbid</i>]<b>:</b><i>URI</i>[<b>:</b>]<br>
Long URI. <b>Ur:</b>[<i>pfx_dbid</i>]<b>:</b><i>dbid</i></p>
<p><u>URI Encoding in Long URI Table</u><br>
URI. <b>Uv:</b><i>head</i>[<b>:</b>] <i>hash</i> <i>tail</i><br>
<br>
<u>Legend</u><br>
<b>U</b> indicates a URI<br>
<i>pfx_dbid</i> is an (integer) identifier of an entry in the prefixes table. If the prefix is too
short, i.e., the length of the prefix is less than URI_COMPRESS_LENGTH (see <a href="options.html">Options</a>), the URI is not compressed and
<i>pfx_dbid</i> is null.<br>
<i>URI</i> is the complete URI<br>
other notation same as for literal encoding<br>
<br>
<u>Blank Node Encoding in Statement Tables</u><br>
Short URI. <b>Bv:</b>[<i>pfx_dbid</i>]<b>:</b><i>bnid</i>[<b>:</b>]<br>
Long URI. <b>Br:</b>[<i>pfx_dbid</i>]<b>:</b><i>dbid</i></p>
<p><u>Blank Encoding in Long URI Table</u><br>
URI. <b>Bv:</b><i>head</i>[<b>:</b>] <i>hash</i> <i>tail</i><br>
<br>
<u>Legend</u><br>
<b>B</b> indicates a blank node<br>
<i>bnid</i> is the blank node identifier<br>
other notation same as above<br>
Note: currently, blank nodes are always stored uncompressed (<i>pfix_dbid</i> is null).
<br>
<br>
<u>Variable Node Encoding in Statement Tables</u><br>
Variable Node. <b>Vv:</b><i>name</i><br>
<br>
<u>Legend</u><br>
<b>V</b> indicates a variable node<br>
<b>v</b> indicates a value<br>
<i>name</i> is the variable name<br>
Note: the length must be less than LONG_OBJECT_LENGTH<br>
<br>
<u>ANY Node Encoding in Statement Tables</u><br>
Variable Node. <b>Av:</b><br>
<br>
<u>Prefix Encoding in Prefix Table</u><br>
Prefix. <b>Pv:</b><i>val</i>[<b>:</b>] [<i>hash</i>] [<i>tail</i>]<br>
<br>
<u>Legend</u><br>
<b>P</b> indicates a prefix<br>
other notation same as above<br>
<i>hash</i> and <i>tail</i> are only required for long prefixes.<br>
&nbsp;</p>



</BODY>



</HTML>